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Abstract 

We introduce a novel algorithm that computes the fc-sparse principal component of a positive semidefinite 
matrix A. Our algorithm is combinatorial and operates by examining a discrete set of special vectors lying in 
a low-dimensional eigen-subspace of A. We obtain provable approximation guarantees that depend on the 
spectral profile of the matrix: the faster the eigenvalue decay, the better the quality of our approximation. 
For example, if the eigenvalues of A follow a power-law decay, we obtain a polynomial-time approximation 
algorithm for any desired accuracy. 

We implement our algorithm and test it on multiple artificial and real data sets. Due to a feature elimination 
step, it is possible to perform sparse PCA on data sets consisting of millions of entries in a few minutes. 
Our experimental evaluation shows that our scheme is nearly optimal while finding very sparse vectors. We 
^—t compare to the prior state of the art and show that our scheme matches or outperforms previous algorithms 

in all tested data sets. 



> 1 Introduction 

Principal component analysis (PCA) reduces the dimensionality of a data set by projecting it onto principal 
subspaces spanned by the leading eigenvectors of the sample covariance matrix. The statistical significance of 
PCA partially lies in the fact that the principal components capture the largest possible data variance. The first 
principal component (i.e., the first eigenvector) of an n x n matrix A is the solution to 

arg max x Ax 

T~ "H I] I] 2=1 

where A = SS T and S is the nx m data-set matrix consisting of m data-points, or entries, each evaluated on n 
features, and || x|| 2 is the 1% -norm of x. PCA can be efficiently computed using the singular value decomposition 
(SVD). The statistical properties and computational tractability of PCA renders it one of the most used tools in 
data analysis and clustering applications. 

A drawback of PCA is that the generated vectors typically have very few zero entries, i.e., they are not 
sparse. Sparsity is desirable when we aim for interpretability in the analysis of principal components. An exam- 
ple where sparsity implies interpretability is document analysis, where principal components can be used to 
cluster documents and detect trends. When the principal components are sparse, they can be easily mapped to 
topics (e.g., newspaper article classification into politics, sports, etc.) using the few keywords in their support 
< |Gawalt et aL||Zhang & El Ghaoui}|20TT] |. For that reason it is desirable to find sparse eigenvectors. 



Sparse PCA. Sparsity can be directly enforced in the principal components. The sparse principal compo- 
nent is defined as 

2;* = arg max x T Ax. (1) 

||z||2 = l|||&||o— 

The £ cardinality constraint limits the optimization over vectors with k non-zero entries. As expected, sparsity 
comes at a cost since the optimization in Q is NP-hard (Moghaddam et al. 2006a) and hence computationally 
intractable in general. 



1 



Our Contribution. We introduce a novel algorithm for sparse PCA that has a provable approximation 
guarantee. Our algorithm generates a fc-sparse, unit length vector x d that gives an objective provably within a 
1 — e<2 factor from the optimal: 

x T d Ax d > (1 - e d ) x*Ax* 




, . I n Xd+i .„-,-, 1 
£d<mm — , — nr }, (2) 



with 

fj < min < 

^ fc Ai 

where A, is the ith largest eigenvalue of ^4 and is the maximum diagonal element of A. For any desired 
value of the parameter d, our algorithm runs in time 0(n d+1 logn). Our approximation guarantee is directly 
related to the spectrum of A: the greater the eigenvalue decay the better the approximation. Equation Q 
contains two bounds: one that uses the largest eigenvalue Ai and one that uses the largest diagonal element of 
A, Aj 1 '. Either bound can be tighter, depending on the structure of the A matrix. 

We subsequently rely on our approximation result to establish guarantees for considerably general families 
of matrices. 

Constant-factor approximation. If we only assume that there is an arbitrary decay in the eigenvalues of A, 
i.e., there exists a constant d = 0(1) such that Xi > Ad+i, then we can obtain a constant-factor approximation 
guarantee for the linear sparsity regime. Specifically we find a constant 60 such that for all sparsity levels 
k > Sq n we obtain a constant approximation ratio for sparse PCA, partially solving the open problem discussed 



in ( |Zhang et al. 2012 |d'Aspremont et aL||2012) . This result easily follows from our main theorem. 



Eigenvalue Power-law Decay. When the data matrix spectrum exhibits a power-law decay, we can obtain a 
much stronger performance guarantee: we can solve sparse PCA for any desired accuracy e in time polynomial 
in n, k (but not in 1). This is sometimes called a polynomial-time approximation scheme (PTAS). Further, the 
power-law decay is not necessary: the spectrum does not have to follow exactly that decay, but only exhibit a 
substantial spectral drop after a few eigenvalues. 

Our algorithm operates by scanning a low-dimensional subspace of A. There, it examines a polynomial 
number of special vectors, that lead to a sparse principal component which admits provable performance. A 
key conceptual innovation that we employ is a hyperspherical transformation on our problem space to reduce 
its dimensionality. Another important component of our scheme is a safe feature elimination step that allows 
the scalability of our algorithm for data-sets with millions of entries. We introduce a test that discards features 
that are provably not in the support of the sparse PC, in a similar manner as (Zhang & El Ghaoui 2011) , but 
using a different combinatorial criterion. 

Experimental Evaluation. We evaluate and compare our algorithm against state of the art sparse PCA 
approaches on synthetic and real data sets. Our real data-set is a large Twitter collection of more than 10 
million tweets spanning approximately six months. We executed several experiments on various subsets of 
our data set: collections of tweets during a specific time-window, tweets that contained a specific word, etc. 
Our implementation executes in less than one second for 50k — lOOfc documents and in a few minutes for 
millions of documents, on a personal computer. Our scheme typically comes closer than 90% of the optimal 
performance, even for d < 3, and empirically outperforms previously proposed sparse PCA algorithms. 



1.1 Related Work 



There has been a substantial volume of prior work on sparse PCA. Initial heuristic approaches used factor 
rotation techniques and thresholding of eigenvectors to obtain sparsity (Kaiser 1958 Jolliffe} 1995 Cadima & 



Jolliffe 



1995) . Then, a modified PCA technique based on the LASSO (SCoTLASS) was introduced in ( [Jolliffe 
et al. |2003) . In ( |Zou et aL 2006 1, a nonconvex regression-type approximation, penalized a la LASSO was used 
to produce sparse PCs. A nonconvex technique was presented in ( |Sriperumbudur et al. |2007) . In (jMoghad 



dam et aL} |2006b) , the authors used spectral arguments to motivate a greedy branch-and-bound approach, 
further explored in ([Moghadd am et aL] 2007) . In ( |Shen & Huang} [2008) , a similar technique to SVD was used 



employing sparsity penalties on each round of projections. A significant body of work based on semidefinite 
programming (SDP) approaches was established in ( d'Aspremont et al.} 2007a||Zhang et al. 2012 |d'Aspremont 
et al. 2008 1. A variation of the power method was used in ([Journee et aL 2010). When comp uting multiple 



PCs, the issue of deflation arises as discussed in (Mackey. 2009|. In (Amini & Wainwright |2008) , the first 
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theoretical optimality guarantees were established for thresholding and the SDP relaxation of < |d'Aspremont| 
et al. 2007a I, in the high-dimensional setting of a generative model where the covariance has one sparse eigen- 
vector. In ( |Yuan & Zhang 20111 1, the authors introduced a very efficient sparse PCA approximation based on 
truncating the well-known power method to obtain the exact level of sparsity desired, which came along with 
peformance guarantees for a specific data model. An algorithm that solves sparse PCA for constant rank matri- 



ces in pol ynomial time was introduced in ( jAsteris et al. 2011 1. The main algorithmic differences from ( |Asteris 



et al. 2011 1 are i) our solver speeds up calculations for matrices with nonnegative entries by a 2 factor in 



running complexity and ii) a safe feature elimination step is introduced that is fundamental in implementing 
the algorithm for large data-sets. 

Despite this extensive literature, to the best of our knowledge, there are very few provable approximation 
guarantees for sparse PCA algorithms and usually under limited data models < |Amini & Wainwright] |2008| 



Yuan & Zhang] |2011{ |d'Aspremont et al.} |2012) 



2 Sparse PCA through Low-rank Approximations 
2.1 Proposed Algorithm 

Our algorithm is technically involved and for that reason we start with a high-level informal description. For 
any given accuracy parameter d we follow the following steps: 

Step 1: Obtain Ad, a rank-d approximation of A. 
We obtain A d , the best-fit rank-d approximation of A, by keeping the first d terms of its eigen-decomposition: 

d 

A d = y]XiVivf, 

8=1 

where Aj is the i-th largest eigenvalue of A and Vi the corresponding eigenvector. 

Step 2: Use Ad to obtain 0(n d ) candidate supports. 
For any matrix A, we can exhaustively search for the optimal by checking all (?) possible k x k subma- 
trices of A: x* is the /c-sparse vector with the same support as the submatrix of A with the maximum largest 
eigenvalue. However, we show how sparse PCA can be efficiently solved on Ad if the rank d is constant with 
respect to n. The key technical fact that we prove is that there are only 0(n d ) candidate supports that need to 
be examined. Specifically, we show that a set of candidate supports Sd = {Ii, ■ ■ ■ ,It}, where X t is a subset of k 
indices from {1, . . . , n}, contains the optimal support. We prove that the number of these supports is^ 

K.I <>-(;)■ 

The above set S d is efficiently created by our Spannogram algorithm described in the next subsection. 

Step 3: Check each candidate support from Sd on A. 
For a given support I it is easy to find the best vector supported on I: it is the leading eigenvector of the 
principal submatrix of A, with rows and columns indexed by I. In this step, we check all the supports in Sd 
on the original matrix A and output the best. Specifically, define A% to be the zeroed-out version of A, except on 
the support X. That is, A% is an n x n matrix with zeros everywhere except for the principal submatrix indexed 
by I. and j € X, then A% = Aij, else it is 0. Then, for any Ax matrix, with I e Sd, we compute its largest 

eigenvalue and corresponding eigenvector. 

Output: 

Finally, we output the fc-sparse vector Xd that is the principal eigenvector of the A% matrix, I E Sd, with the 
largest maximum eigenvalue. We refer to this approximate sparse PC solution as the rank-d optimal solution. 

The exact steps of our algorithm are given in the pseudocode tables denoted as Algorithm 1 and 2. The 
spannogram subroutine, i.e., Algorithm 2, computes the T candidate supports in Sd, and is presented and 
explained in Section [3] The complexity of our algorithm is equal to calculating d leading eigenvectors of A 
(0(dn 2 )), running our spannogram algorithm {0(n d+1 logn)), and finding the leading eigenvector of 0(n d ) 
matrices of size k x k (0{n d k 2 )). Hence, the total complexity is 0(n d+1 \ogn + n d k 2 + dn 2 ). 

1 ln fact, in our proof we show a better dependency on d, which however has a more complicated expression. 
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Algorithm 1 Sparse PCA via a rank-d approximation 



1: Input: k, d, A 

2: p <— 1 if A has nonnegative entries, if mixed 

4: Ad <— feature_elimination(A:,p, Ad) 
5: Sd <— Spannogram^fc,p, 
6: for each I e 5^ do 
7: Calculate Ai 
8: end for 

9: 2^ pt = argmax /e5(i X X (A X ) 

10: OPT d = Ai(A J o P «) 

11: x^ pt <- the principal eigenvector of A x o P t. 

d 

12: Output: a£ pt 



Elimination Step: 

This step is run before Step 2. By using a feature elimination subroutine we can identify that certain variables 
provably cannot be in the support of xg, the rank-e? optimal sparse PC. We have a test which is related to the 
norms of the rows of Vd that identifies which of the n rows cannot be in the optimal support. We use this step 
to further reduce the number of candidate supports \Sd\- The elimination algorithm is very important when it 
comes to large scale data sets. Without the elimination step, even the rank-2 version of the algorithm becomes 
intractable for n > 10 4 . However, after running the subroutine we empirically observe that even for n that is 
in the orders of 10 6 the elimination strips down the number of features to only around 50 — 100 for values of k 
around 10. This subroutine is presented in detail in the Appendix. 



2.2 Approximation Guarantees 

The desired sparse PC is 



arg max x 

iMl2=l,||a:||o=fe 



T Ax. 



We instead obtain the /c -sparse, unit length vector Xd which gives an objective 

x^Axd — max A(Ai). 

We measure the quality of our approximation using the standard approximation factor: 

pa _x T d A Xd _ ^l X{Al \ 
xJAx* ^( fe ) 

where A^ = xj Ax* is the fc-sparse largest eigenvalue of A^ Clearly, pd < 1 and as it approaches 1, the approx- 
imation becomes tighter. Our main result follows: 

Theorem 1. For any d, our algorithm outputs Xd, where \ \xd\\o=k, \\Xd\\i=l and 

XdAxd > (1 - e)xjAx*, 

with an error bound 

, J n X d+ i X d+ i \ 

ed - min \k^^yj- 

2 Notice that the fc-sparse largest eigenvalue of A for k = 1 denoted by A^ 1 ' is simply the largest element on the diagonal of A. 
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Proof. The proof can be found in the Appendix. The main idea is that we obtain i) an upper bound on the 

( k) 

performance loss using Ad instead of A and li) a lower bound for A^ . □ 
We now use our main theorem to provide the following model specific approximation results. 

Corollary 1. Assume that for some constant value d, there is an eigenvalue decay \\ > \d+i in A. Then there exists a 
constant So such that for all sparsity levels k > Squ we obtain a constant approximation ratio. 

Corollary 2. Assume that the first d+l eigenvalues of A follow a power-law decay, i.e., \ t = Ci~ a ,for some C, a > 0. 
Then, for any k — 8n and any e> Owe can get a (1 — e)-approximate solution Xd in time O {ji 1 /^ 5 ) a + 1 logn). 

The above corollaries can be established by plugging in the values for A^ in the error bound. We find the 
above families of matrices interesting, because in practical data sets (like the ones we tested), we observe a 
significant decay in the first eigenvalues of A which in many cases follows a power law. The main point of the 
above approximability result is that any matrix with decent decay in the spectrum endows a good sparse PCA 
approximation. 

3 The Spannogram Algorithm 

In this section, we describe how the Spannogram algorithm constructs the candidate supports in Sd and explain 
why this set has tractable size. We build up to the general algorithm by explaining special cases that are easier 
to understand. 



3.1 Rank-1 case 

Let us start with the rank 1 case, i.e., when d= 1. For this case 

Ai = XiVxvf. 

Assume, for now, that all the eigenvector entries are unique. This simplifies tie-breaking issues that are for- 
mally addressed by a perturbation lemma in our Appendix. For the rank-1 matrix A\, a simple thresholding 
procedure solves sparse PCA: simply keep the k largest entries of the eigenvector vy. Hence, in this simple case 
Si consists of only 1 set. 

To show this, we can rewrite Q as 

2 ( n \ 2 

maxi T A\x = Ai • max (vTx) — Ai • max > vuXi , (3) 

x£S k x£S k y ' x£§ k I •f-j' J 

where is the set of all vectors x € R" with ||a;||2 = 1 and ||x||o = k. We are trying to find a fc-sparse vector 
x that maximizes the inner product with a given vector v\ . It is not hard to see that this problem is solved 
by sorting the absolute elements of the eigenvector v\ and keeping the support of the k entries in vi with the 
largest amplitude. 

Definition 1. Let Ik(v) denote the set of indices of the top k largest absolute values of a vector v. 

We can conclude that for the rank-1 case, the optimal fc-sparse PC for Ai will simply be the fc-sparse vector 
that is co-linear to the fc-sparse vector induced on this unique candidate support. This will be the only rank-1 
candidate optimal support 

S 1 = {I h (v 1 )}. 

3.2 Rank-2 case 

Now we describe how to compute S 2 - This is the first nontrivial d which exhibits the details of the spannogram 
algorithm. We have the rank 2 matrix 

2 

A 2 = ^A^ T = v 2 y 2 T , 
i=i 
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where Va = [VAi ■ V\ ■ U2] ■ We can rewrite |T| on A 2 as 

maxx T A 2 x = max II T/^xll^. (4) 
In the rank-1 case we could write the quadratic form maximization as a simple maximization of a dot product 

maxx T Aix = max.(vTx) 2 . 

x£§ k x£S k V ' 

Similarly, we will prove that in the rank-2 case we can write 

maxx 7 1 A2X = max (vjx) , 

xGSfc " x£S k 

for some specific vector v c in the span of the eigenvectors 1)1,1)2', this will be very helpful in solving the problem 
efficiently. 

To see this, let c be a 2 x 1 unit length vector, i.e., ||c|| 2 = 1. Using the Cauchy-Schwartz inequality for the 

inner product of c and V 2 T x we obtain (c r V 2 T x) < || V 2 T x\\ 2 , where equality holds, if and only if, c is co-linear 
to V 2 T x. By the previous fact, we have a variational characterization of the ^ 2 - n orm: 



\V 2 T x\\ 2 2 = max (c T V 2 T x) . (5) 



We can use |5| to rewrite Q as 



max (c V 9 x) = max max (v r x) 

xeS*,||c||a=l V ' ||c|| 2 =l V ' 

= max max(vjx) 2 , (6) 

|| c || 2 =izes fc v > 

where v c = V 2 c. 

We would like to note two important facts here. The first is that for all unit vectors c, v c = V 2 c generates 
all vectors in the span of V 2 (up to scaling factors). The second fact is that if we fix c, then the maximization 

max^gs,, (wjx) 2 is a rank-1 instance, similar to (|3). Therefore, for each fixed unit vector c there will be one 
candidate support (denote it by I k {V 2 c)) to be added in S 2 . 

If we could collect all possible candidate supports Ik(V 2 c) in 

S2= |J {Ik(V2c)}, (7) 

ceR 2 >< 1 : ||c||2=i 

then we could solve exactly the sparse PC A problem on A 2 : we would simply need to test all locally optimal 
solutions obtained from each support in S 2 and keep the one with the maximum metric. The issue is that there 
are infinitely many v c vectors to check. Naively, one could think that all possible fc-supports could appear 
for some v c vector. The key combinatorial fact is that if a vector v c lives in a two dimensional subspace, there 
are tremendously fewer possible supports^] 

|S 2 |<4(^ 



Spherical variables. Here we use a transformation of our problem space into a 2-dimensional space. The 
transformation is performed through spherical variables that enable us to visualize the 2-dimensional span of 
V 2 . For the rank-2 case, we have a single phase variable <f> G 3> = (— ^, |-1 and use it to rewrite c, without loss 
of generality, as 

sine 

c = 

cos< 

3 This is a special case of our general d dimensional lemma (found in the Appendix), but we prove the special case to simplify the 
presentation. 
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Figure 1: A rank-2 spannogram for a V2 matrix with n = 3. 

which is again unit norm and for all 4> it scans alQ 2x1 unit vectors. Under this characterization, we can 
express v c in terms of <f> as 

v((j>) = V2C = sin 4> ■ yXiVi + cos (f> ■ 1/^2^2- (8) 
Observe that each element of v(cj>) is a continuous curve in <f>: 



v (<t l )}i — yAi^i sin(</>) + v^2«2 cos(0), 



2 



for alH = 1, . . . , n. Therefore, the support set of the k largest absolute elements of v(<j>) (i.e., lk(v(<fi))) is itself a 
function of (j>. 

The Spannogram. In Fig. [lj we draw an example plot of 3 (absolute) curves | [v(<f>)] ;|, i — 1,2,3, from a 
randomly generated matrix V2 ■ We call this a spannogram, because at each <f>, the values of curves correspond 
to the absolute values of the elements in the column span of V^- Computing [v((j>))i for all i,(f> is equivalent 
to computing the span of V 2 . From the spannogram in Fig. Ill we can see that the continuity of the curves 
implies a local invariance property of the support sets around a given <f>. That is, we expect that 

Ik(v(4>± e)) = Zk(v(<j))), for a sufficiently small e > 0. As a matter of fact, a support set lk(v((p)) changes, 
if and only if, the respective sorting of two absolute elements | [«(</>)]«! an d |[ u (0)]j| changes. Finding these 
interesection points | [v((j))]i \ = \ [v((j))]j | is the key to find all possible support sets. 

There are n curves and each pair intersects on exactly two pointsFl Therefore, there are exactly 2Q) in- 
tersection points. The intersection of two absolute curves are exactly two points 4> that are a solution to 
[v(4>)]i = [v(4>)]j and = — [v(<f>)]j. These are the only points where local support sets might change. These 

2 Q) intersection points partition $ in 2 Q) + 1 regions within which the top k support sets remain invariant. 

Building S2. To build S2, we need to i) determine all c intersection vectors that are defined at intersec- 
tion points on the <fi-axis and ii) compute all distinct locally optimal support sets Ik(v c ). To determine an 
intersection vector we need to solve all 2(") equations = ±[v((j>)]j for all pairs i,j € [n]. This yields 

[v((f>)]i = ±[v(4>)]j ej Vc = ±ejVc, that is 



fer±eJ)V"c = 0^c=nullspace((ef ±ej) V) . (9) 



4 Note that we restrict ourselves to (— w, 5 ], instead of the whole (— 7r, ir] angle region. First observe that the vectors in the complement 
of & are opposite to the ones evaluated on Omitting the opposite vectors poses no issue due to the squaring in jlj, i.e., vectors c and — c 
map to the same solutions. 

5 As we mentioned, we assume that the curves are in "general position," i.e., no three curves intersect at the same point and this can be 
enforced by a small perturbation argument presented in the Appendix. 
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Since c needs to be unit norm, we simply need to normalize the solution c. We will refer to the intersection 
vector calculated on the <fi of the intersection of two curves i and j as cfj and c~j, depending on the corre- 
sponding sign in ||9}. For the intersection vectors cf^ and c~j we compute Xk{V2cf j) and Xfc^c^). Observe 
that since the i and j curves are equal on the intersection points, there is no prevailing sorting among the two 
corresponding elements i and j of V^c^ or V^cT^. Hence, for each intersection vector cfj and c^-, we create 
two candidate support sets, one where element i is larger than j, and vice versa. This is done to secure that 
both support sets, left and right of the cf) of the intersection, are included in S2 ■ With the above methodology, 
we can compute all possible Ik (Vac) rank-2 optimal candidate sets and we obtain 

|5 2 |< 4 Q=0(n 2 ). 

The time complexity to build S2 is then equal to sorting Q) vectors and solving 2(") equations in the 2 
unknowns of cfj and cfj. That is, the total complexity is equal to (^jnlogn + ( r 2 l )2 2 = O (n 3 logn). 



Algorithm 2 Spannogram Algorithm for Sd- 



Input: k, p, V d = [y/X^Vi . . . V%>i] 
Initialize Sd 

B<-{1,...,1} 
if p = then 

B^{b 1 ,...,b d ^ 1 }e{±l} d ~ 1 

end if 

for all (™) subsets . . . , i d ) from {l,...,n} do 
for all sequences (61, . . . € ^ do 



c <— nullspace 



if p = 1 then 

I indices of the fc-top elements of Vc 
else 

X indices of the fc-top elements of abs(yc) 
end if 

I <- 1 
Ji ^- T-\-m 

r<-\JiC\ (ii,...,i d )\ 
if r < d then 

for all r-subsets M from . . . do 
/ + 1 

end for 
end if 

S d ^5 d U J1...U Ji. 
end for 
end for 
Output: Sd- 



Remark 1. The spannogram algorithm operates by simply solving systems of equations and sorting vectors. It is not 
iterative nor does it attempt to solve a convex optimization problem. Further, it computes solutions that are exactly 
k-sparse, where the desired sparsity can be set a-priori. 

The spannogram algorithm presented here is a subroutine that can be used to find the leading sparse PC 
of Ad in polynomial time. The general rank-<i case is given as Algorithm 2. The details of our algorithm, the 
elimination step, and tune-ups for matrices with nonnegative entries can be found in the Appendix. 
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4 Experimental Evaluation and Conclusions 



We now empirically evaluate the performance of our algorithm and compare it to the full regularization path 
greedy approach (FullPath) of ( jd'Aspremont et aT 2007b|, the generalized power method (GPower) of ( |Journee| 
|et al.}|2010| ), and th e truncated power method (TPower) of ( |Yuan & Zhang[|2011} . We omit the DSPCA semidef- 
inite approach of ( jd'Aspremont et aT 20073)1, sin ce the FullPath algorithm is experimentally shown to have 
similar or better performance ( jd'Aspremont et al. 2008) . 

We start with a synthetic experiment: we seek to estimate the support of the first two sparse eigenvectors 
of a co variance matrix from sample vectors. We continue with testing our algorithm on gene expression data 
sets. Finally we run experiments on a large-scale document-term data set, comprising of millions of Twitter 
posts. 



4.1 Spiked Covariance Recovery 

We first test our approximation algorithm on an artificial data set generated in the same manner as in (jShen 



& Huang 



2008 



Yuan & Zhang 2011[ |. We consider a covariance matrix E, which has two sparse eigenvectors 



with very large eigenvalues and the rest of the eigenvectors correspond to small eigenvalues. Here, we consider 
E = Ya=i ^i v i v I with Ai = 400, A 2 = 300, A3 = 1, ... , A 50 o = 1. where the first two eigenvectors are sparse and 
each has 10 nonzero entries and non-overlapping supports. The remaining eigenvectors are picked as n — 2 
orthogonal vectors in the nullspace of [vi V2\- 

We have two sets of experiments, one for few samples and one for extremely few. First, we generate 
to = 50 samples of length n = 500 distributed as zero mean Gaussian with covariance matrix E and repeat 
the experiment 5000 times. We repeat the same experiment for to = 5. We compare our rank-1 and rank-2 
algorithms against FullPath, GPower with l\ penalization and £q penalization, and TPower. After estimating 
the first eigenvector with v\, we deflate A to obtain A . We use the projection deflation method (Mackey 2009) 
to obtain A' = (/- v x v()A(I 



v x vj) 



and work on it to obtain v%, the second estimated eigenvector of E. 



In Table 1, we report the probability of correctly recovering the supports of v x and v 2 - if both estimates vi 
and V2 have matching supports with the true eigenvectors, then the recovery is considered successful. 







500 x 50 


500 x 5 




k 


Prec. 


Prec. 


PCA+thresh. 


10 


.98 


0.85 


GPower-^ (7 = 0.8) 


10 


1 


0.33 


GPower-^i (7 = 0.8) 


10 


1 


0.33 


FullPath 


10 


1 


0.96 


TPower 


10 


1 


0.96 


Rank-2 approx. 


10 


1 


0.96 



Table 1 : Performance results on the spiked covariance model, where p lec . represents the recovery probability of 
the correct supports of the two sparse eigenvectors of E. 

In our experiments for m = 50, all algorithms were comparable and performed near-optimally, apart from 
the rank-1 approximation (PCA+thresholding). The success of our rank-2 algorithm can be in parts suggested 
by the fact that the true covariance E is almost rank 2: it has very large decay between its 2nd and 3rd eigen- 
value. The average approximation guarantee that we obtained from the generating experiments for the rank 
2 case and for to = 50 was x\Ax-i > 0.7 • x*Ax£ , that is before running our algorithm, we know that it could 
on average perform at least 70% as good as the optimal solution. For m = 5 samples we observe that the per- 
formance of the rank-1 and GPower methods decay and FullPath, TPower, and rank-2 find the correct support 
with probability approximately equal to 96%. This overall decay in performance of all schemes is due to the 
fact that 5 samples are not sufficient for a perfect estimate. Interesting tradeoffs of sample complexity and 
probability of recovery where derived in (Amini & Wainwright, 2008). Conducting a theoretical analysis for 
our scheme under the spiked covariance model is left as an interesting future direction. 
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4.2 Gene Expression Data Set 
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Figure 2: Results on gene expression data sets. 

In the same manner as in the relevant sparse PCA literature, we evaluate our approximation on two gene 
expression data-sets used in ( jd'Aspremont et aL |2007b{ [2008{ |Yuan & Zhang 2011} . We plot the ratio of the 
explained variance coming from the first sparse PC to the explained variance of the first eigenvector (which is 
equal to the first eigenvalue). We also plot the performance outer bound derived in (d'Aspremont et al. 20081. 
We observe that our approximation follows the same optimality pattern as most previous methods, for many 
values of sparsity k. In these experiments we did not test the GPower method since the output sparsity cannot 
be explicitly predetermined. However, previous literature indicates that GPower is also near-optimal in this 
scenario. 



4.3 Large-scale Twitter data-set 

We proceed our experimental evaluation of our algorithm by testing it on a large-scale data set. Our data-set 
comprises of millions of tweets coming from Greek Twitter users. Each tweet corresponds to a list of words and 
has a character limit of 140 per tweet. Although each tweet was associated with metadata, such us hyperlinks, 
user id, hash tags etc., we strip these features out and just use the word list. We use a simple Python script 
to normalize each Tweet. Words that are not contextual (me , to , what , etc) are discarded in an ad-hoc way. 
We also discard all words that are less than three characters, or words that appear once in the corpus. We 
represent each tweet as a long vector consisting of n words, with a 1 whenever a word appears, and if it does 
not appear. Further details about our data set can be found in the Appendix. 

Document-term data sets have been observed to follow power-laws on their eigenvalues. Empirical results 
have been reported that indicate power-law like decays for eigenvalues where no c utoff is observed <|Dh illon 



& Modha 20011 and some derived power-law generative models for 0/1 matrices (Mihail & Papadirnitriou, 
2002 ; |Chung et aL) 2003} . In our experiments, we also observe power-law decays on the spectrum of the twitter 



matrices. Further experimental observations of power laws can be found in the Appendix. These underlying 
decay laws on the spectrum were sufficient to give good approximation guarantees; for many of our data sets 
1 — e was between 0.5 to 0.7, even for d = 2,3. Further, our algorithm empirically performed better than these 
guarantees. 

In the following tests, we compare against TPower and FullPath. TPower is run for lOfc iterations, and is 
initialized with a vector having Is on the k words of highest variance. For FullPath we restrict the covariance to 
its first 5k words of highest variance, since for larger numbers the algorithm became slow to test on a personal 
desktop computer. In our experiments, we use a simpler deflation method, than the more sophisticated ones 
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Table 2: The first 5 sparse PCs for a data-set consisting of 65k Tweets and 64k unique words. Words that 
appear with a G are translated from Greek. 



11 



used before. Once k words appear in the first fc-sparse PC, we strip them from the data set, recompute the 
new convariance, and then run all algorithms. A benefit of this deflation is that it forces all sparse PCs to be 
orthogonal to each other which helps for a more fair comparison with respect to explained variance. Moreover, 
this deflation preserves the sparsity of the matrix A after each deflation step; sparsity on A facilitates faster 

execution times for all methods tested. The performance metric here is again the explained variance over its 

V>£ X T A X . 

maximum possible value: if we compute L PCs, xx, ■ ■ ■ , xl, we measure their performance as ' A — -. We 

see that in many experiments, we come very close to the optimal value of 1. 

In Table [5J we show our results for all tweets that contain the word Japan, for a 5-day (May 1-5, 2011) 
and then a month-length time window (May, 2011). In all these tests, our rank-3 approximation consistently 
captured more variance than all other compared methods. 

In Table |5J we show a day-length experiment (May 10th, 2011), where we had 65k Tweets and 64k unique 
words. For this data-set we report the first 5 sparse PCs generated by all methods tested. The average compu- 
tation times for this time-window where less than 1 second for the rank-1 approximation, less than 5 seconds 
for rank-2, and less than 2 minutes for the rank-3 approximation on a Macbook Pro 5.1 running MATLAB 7. 
The main reason for these tractable running times is the use of our elimination scheme which left only around 
40 — 80 rows of the initial matrix of 64k rows. In terms of running speed, we empirically observed that our 
algorithm is slower than Tpower but faster than FullPath for the values of d tested. In Table |2j words with 
strike-through are what we consider non-matching to the "main topic" of that PC. Words marked with G are 
translated from Greek. From the PCs we see that the main topics are about Skype's acquisition by Microsoft, 
the European Music Contest "Eurovision," a crime that occurred in the downtown of Athens, and the Greek 
census that was carried for the year 2011. An interesting observation is that a general "excitement" sparse 
principal component appeared in most of our queries on the Twitter data set. It involves words like like, 
love, liked, received, great, etc, and was generated by all algorithms. 

We conclude that our algorithm can efficiently provide interpretable sparse PCs while matching or out- 
performing the accuracy of previous methods. A parallel implementation in the MapReduce framework and 
larger data studies are very interesting future directions. 
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Table 3: Performance comparison on the Twitter data-set 



Appendix 

A Rank-d candidate optimal sets Sd 

In this section, we generalize the concepts of the S2 construction to the general d case and prove the following: 
Lemma 1. The rank-d optimal set Sd has 0(n d ) candidate optimal solutions and can be build in time 0(n d+1 logn). 
Here, we use Ad = VdVj where Vd — U/Xx ■ v\ ■ ■ ■ V~^d • Vd] ■ We can rewrite the optimization on Ad as 

vi\ax-X T AdX = max IIkTxIL = max (c T VTx) = max (v'Fx) , (10) 
xes k 11 a 112 ^GS fc ,Hc|| 2 =i v a ' xeS k ,\\c\\ 2 =l K c ' 

where v c = V d c. Again, for a fixed unit vector c, T k (v c ) is the locally optimal support set of the /c-sparse vector 
that maximizes (vj'x) . 
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Hyperspherical variables and intersections. For this case, we introduce d— 1 angles ip = [<f>i,. 



7T 7T 

" 2 ' 2 



] rf 1 which are used to restate c as a hyperspherical unit vector 



sin 0i 
cos 4>i sin 4>2 

cos 4>i cos 4>2 ■ ■ ■ sin 4>d- 1 
cos </>i cos 4>2 ■ ■ ■ cos (f>d- 1 



In this case, an element of v c = VdC is a continuous d-dimensional function on rf — 1 variables ^ = [0i , 
i.e., it is a (d — 1) -dimensional hypersurface: 



[v(ip)]i = sin0i ■ [V-Mili + cos 0i sin 02 ■ [V ^2^2]* + ■ • • + cos0icos0 2 ■ • .cos0 d _i[V \dVd]i- 
In Fig. |3j we draw a d = 3 example spannogram of 4 curves, from a randomly generated matrix V3. 



f>d-i e 



«d-i 




Figure 3: A rank-3 spannogram for a 3 x 3 matrix V3. 



Calculating Ik{v c ) for a fixed vector c is equivalent to finding the relative sortings of the n (absolute value) 
hypersurfaces |[i>(y)]t| at that point. The relative sorting between two surfaces and [i> (<£>)] j changes on 

the point where [w(</5)]i = [^(v 3 )]^- A difference with the rank-2 case, is that here, the solution to the equation 
[u(iy9)]i = [v(ip)] j is not a single point (i.e., a single solution vector c), but a (d— 1) dimensional space of solutions. 
Let 

x i,j = {«(</?) : [v(<p)]i = [v(<p)]j,<pe I 

be the the set of all v(ip) vectors where [i>(y>)]i = [u(y>)]j in the (p domain. Then, the elements of the set Xi j 
correspond exactly to the intersection points between hypersurfaces i and j. 
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Since locally optimal support sets change only with the local sorting changes, the intersection points de- 
fined by all Xi j sets are the only points of interest. 

Establishing all intersection vectors. We will now work on the (d — 2) -dimensional space Xi.j. For the 
vectors in this space, there are again critical (p points, where both the i and j coordinates become members 
of a top-/c set. These are the points when locally optimal support sets change. This happens when the i 
and j coordinates become equal with another l-th coordinate of v{(j>). This new space of v(<f>) vectors where 
coordinates i,j,and I are equal will be the set Xij k, this will now be a (d— 3) -dimensional subspace. These 
intersection points can be used to generate all locally optimal support sets. From the previous set we need to 
only check points where the three curves studied intersect with an additional one that is the bottom curve of 
a top k set. Again, these intersection points are sufficient to generate all locally optimal support sets. We can 
proceed in that manner until we reach the single-element set A"^ ^ ; d which corresponds to the vector v((j>) 
defined over $ , where d curves intersect, i.e., 



NyOk = H<p)]i 3 = ■■■ = [v((p)]i d . 



(ii) 



Observe that for each of these d curves we need to check both of their signs, that is all equations [v(ip)]i 1 = 
bi[v(ip)]i 2 = . . . = bd-i[v(<p)]i d need to be solved for c, where bi £ {±1}- Therefore, the total number of inter- 
section vectors c is equal to 2 d ~ 1 Q). These are the only vectors that need to be checked, since they are the 
potential ones where d-tuples of curves may become part of the top k support set. 

Building Sj. To visit all possible sorting candidates, we now have to check the intersection points, which 
are obtained by solving the system of d — 1, linear in c(<^i :( j-i), equations 



[v((p)] h = bi [v((f)] i2 = ... = b d _ 1 [v((p)] id 
^Kv)]u =&iK¥0k>--->Kp)k = bd-i[v(ip)]i d 



where bi £ {±1}- Then, we can rewrite the above as 



Vc = 



(d-l)xl 



(12) 



(13) 



where the solution is the nullspace of the matrix multiplying c, which has dimension 1 . 

To explore all possible candidate vectors, we need to compute all possible 2 d-1 Q) solution intersection 
vectors c. On each intersection vector we need to compute the locally optimal support set Ik(v c ). Then, due 
to the fact that the i\, . . . coordinates of v c have the same absolute value, we need to compute, on c, at 
most (|"<f"|) tuples of elements that are potentially in the boundary of the bottom k elements. This is done to 
secure that all candidate support sets "neighboring" an intersection point are included in Sd- This, induces at 
most (pi) local sortings, i.e., rank-1 instances. All these sorting will eventually be the elements of the Sd set. 

The number of all candidate support sets will now be 2 d ~ 1 (r<n) Q) and the total computation complexity is 

0(n d+l loKn). 



B Nonnegative matrix speed-up 

In this section we show that if A has nonnegative entries then we can speed up computations by a factor of 
2 d ~ 1 . The main idea behind this speed-up is that when A has only nonnegative entries, then in our intersection 
equations in Eq. {13) we do not need to check all possible signed combinations of the d curves. In the following 
we explain this point. 

We first note that the Perron-Frobenius theorem ( Horn & Johnson 1990) grants us the fact that the opti- 



mal solution will have nonnegative entries. That is, if A has nonnegative entries, then x* will also have 
nonnegative entries. This allows us to pose a redundant nonnegativity constraint on our optimization 

maxx T Ax = max x T Ax. (14) 
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Figure 4: An example of a spannogram for n = 6, d = 2. Assume that k = 1. Then, the candidate optimal 
supports are 6>2 = {{1}>{2}}, that is either the blue curve (i = 1) is the top one, or the green curve (i = 2) 
becomes the top one, depending on the different values of <j>. Finding the intersection points between these 
two curves is sufficient to recover these optimal supports. The idea of the elimination is that curves with 
(maximum) amplitude less than the amplitude of these types of intersection points can be safely discarded. In 
our example, after considering the blue and green curves and obtaining their intersection points, we can see 
that all other curves apart from the purple curve can be discarded; their amplitudes are less than the lowest 
intersection point of the blue and green curves. Our elimination step formalizes this idea. 

Our approximation uses the above constraint to reduce the cardinality of Sd by a factor of 2 <i ~ 1 . Let us consider 
for example the rank 1 case: 



Here,the optimal solution can be again found in time O(nlogn). First, we sort the elements of v. The optimal 
support I the for above problem corresponds to either the top k, or the bottom k unsigned elements of the 
sorted v. The fact that is important here is that the optimal vector can only have entries of the same signj^The 
implication of the previous fact is that on our curve intersection points, we can only account for intersections 
of the sort [u(<^)]j = [v((p)]j. Intersection of the form = — [u(<p)]j are not to be considered due to the fact 

that the locally optimal vector can only have one of the two signs. This means that in Eq. (13} , we only have a 
single sign pattern. This eliminates exactly a factor of 2 d ~ 1 from the cardinality of the Sd set. 



In this section we present our feature elimination algorithm. This step reduces the dimension n of the problem 
and this reduction in practice is empirically shown to be significantl and allows us to run our algorithm for 
very large matrices A. Our elimination algorithm is combinatorial and is based on sequentially checking the 
rows of Vd, depending on the value of their norm. This step is based again on the spannogram framework used 
in our approximation algorithm for sparse PCA. In Fig. 4j we sketch the main idea of our elimination step. 

The essentials of the elimination. First note that a locally optimal support set Xk{Vc) for a fixed c in fit)) , 
corresponds to the top k elements of v c . As we mentioned before, all elements of v c correspond to hypersurfaces 
| [v (<p)]i | that are functions of the d — 1 spherical variables in tp. For a fixed tp € <£ >d ~ 1 , the candidate support set 
corresponds exactly to the top k (in absolute value) elements in v c — v(tp), or the top k surfaces | [u(</?)]i| for that 
particular tp. There is a very simple observation here: a surface belongs to the set of top k surfaces if 

6 If there are less than k elements of the same sign in either of the two support sets in T\, then, and in order to satisfy the sparsity 
constraint, we can put weight e > on elements with the least amplitude in such set and opposite sign. This will only perturb the 
objective by a component proportional to e, which can then be driven arbitrarily close to 0, while respecting the sparsity constraint. 




C Feature Elimination 
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(a) Start with the curves of highest amplitude. Then, 
find their intersection points (red dots) and put them 
in the set V\ ■ 















- 


2 
1 
1 
1 


• intersection points 
— >W)|3 


/ 

/ / 

/ / 
/ / 
/ / 
/ / 
/ / 


N 










12 




s. / / 
\ / / 




10 

8 


amplitude of curve v . 
is larger than lowest 
intersection point 


V / 

/ x \ / * 




6 


\ 


1 \ 1 / 
f \ / ■ 

V / ' 




4 


* / 
\ / 


\\ / /' 




2 


\ / 
\ / 
\ / 
\ / 


\ \ / / 

\ / \ / 






1.5 -1 -0.5 


0.5 1 


1.5 



(b) Examine the curve with amplitude that is the largest 
among the ones not tested yet. If the curve has amplitude 
less than the minimum intersection point in V\, discard it. 
Also, discard all curves with amplitude less than that. If it 
has amplitude higher than the minimum point in V\ , then 
compute its intersection points with the curves already ex- 
amined. For each new intersection point check whether it 
has ft — 1 curves above it. If yes, add it to V\. Retest all 
points in V\; if there is a point that has more than k — 1 
curves above it, discard it from V\ . 




-1.5 -1 -0.5 0.5 1 1.5 



(c) Repeat the same steps. Check if the amplitude of the 
lowest intersection point is higher than the amplitude of the 
curve next in the sorted list. 




-1.5 -1 -0.5 0.5 1 1.5 



(d) Eventually this process will end by finding a curve with 
amplitude less than the intersection points in Vi ■ It will then 
discard all curves with amplitude less, as shown in the figure 
above. 



Figure 5: A simple elimination example for n = 6, d = 2, and k = 1. 
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is below at most k — 1 other surfaces on that ip. If it is below k surfaces at that point ip, then 
does not belong in the set of k top surfaces. 

A second key observation is the following: the only points <p that we need to check are the critical inter- 
section points between d surfaces. For example, we could construct a set 34 of all intersection points of all d 
sets of curves, such that for any point in this set the number of curves above it is at least k — 1. In other words, 
34 defines a boundary: if a curve is above this boundary then it may become a top k curve; if not it can never 
appear in a candidate set. This means that we could test each curve against the points in 34 an d discard it 
if its amplitude is less than the amplitudes of all intersection points in 34- However, the above elimination 
technique implies that we would need to calculate all intersection points on the n surfaces. Our goal is to use 
the above idea by serially checking one by one the intersection points of high amplitude curves. 

Elimination algorithm description. We use the above ideas, to build our elimination algorithm. We first 
compute the norms of each row |j [Vd] : ,i||2 of Vd- This norm corresponds to the amplitude of [v(<p)]i. Then, we 
sort all n rows according to their norms. We first start with the k + d rows of Vd (i.e., surfaces) of highest norm 
(i.e., amplitude) and compute their ( ~y) intersection points. For each intersection point, say <p, we compute 
the number of | surfaces above it. If there are less than k — 1 surfaces above an intersection point, then 

this means that such point is a potential intersection point where a new curve enters a local top k set. We keep 
all these points in a set 74- 

We then move to the (k + d + l)-st surface of highest amplitude; we test it against the minimum amplitude 
point in 74. If the amplitude of the (k + d + l)-st surface is less than the minimum amplitude point in 74, then 
we can safely eliminate this surface (i.e., this row of Vd), and all surfaces with maximum amplitude smaller 
than that (i.e., all rows of Vd with norm smaller than the row of interest). If its amplitude is larger than the 
amplitude of this point, then we compute the new set of ( fe+ ^ +1 ) intersection points obtained by adding this 
new surface. We check if some of these can be added in 74/ using the test of whether there are at most k — 1 
curves above each point. We need also re-check all previous points in 74/ since some may no longer be eligible 
to be in the set; if some are not, then we delete them from the set 74. We then move on the next row of Vd, and 
continue this process until we reach a row with norm less than the minimum amplitude of the points in 74 • 

A pseudo-code for our feature elimination algorithm can be found as Algorithm 1. In Fig. |5j we give an 
example of how our elimination works. 



D Approximation Guarantees 

In this section, we prove the approximation guarantees for our algorithm. Let us define two quantities, namely 

OPT = maxx T Ax and OPT^ = max a; 7 AdX, 

which correspond to the optimal values of the initial maximization under the full-rank matrix A and its rank-<i 
approximation Ad, respectively. Then, we establish the following lemma. 

Lemma 2. Our approximation factor is lower bounded as 

maxA(A x ) QpTd 

Pd= A W -OPT- (15) 

Proof. The first technical fact that we use is that an optimizer vector Xd for Ad (i.e., the one with the maximum 
performance for Ad), can achieve at least the same performance for A, i.e., xjAxd > x^AdXd- The proof is 
straightforward: since A is PSD, each quadratic form inside the sum J>2? = i \x T vivj x is a positive number. 
Hence, YH=\ \x T VivJx > Yli=i Kx T Vivfx, for any vector x and any d. 

The second technical fact is that if we are given a vector Xd with nonzero support X, then calculating qd, the 
principal eigenvector of Ax, results in a solution for A with better performance compared to Xd- To show that, 
we first rewrite Xd as Xd = PxXd, where P% is an n x n matrix that has Is on the diagonal elements that multiply 
the nonzero support of Xd and has Os elsewhere. Then, we have 

OPT d < x T d Ax d = x T d PxAPxXd = x T d AxXd (16) 

< max x T A x x = qjA x qd = q^Md- 

IM|2 = 1 
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Algorithm 3 Elimination Algorithm. 



f\[vi . . . VAi 



Input: k, p, V d - 
Initialize V k <- 

Sort the rows of Vd in descending order according to their norms ||e| 

for all Q) subsets (ii, . . . , «d) from {1, . . . 
for all sequences . . . , bd-i) € S do 



^11- 



i} do 



nullspace 



v d 



e T — ■ e 1 
n a 1 »„ 

if there are fc — 1 elements of |u c | greater than |eiVdc| then 
end if 

if ||Va+i,:|| < min{x e V k } then 

STOP ITERATIONS, 
end if 

n •<— n + 1 

for each element x in 7\, do 

check the elements |w c | greater than x 
if there are more than k — 1 then 

discard it 
end if 
end for 
end for 
end for 
Output: A d 



VdYj , where Vd comprises of the first n rows of Vd of highest norm. 



Using the above fact for all sets I € Sd, we obtain that maxA(^4i) > OPT^, which proves our lower bound. □ 

X^Sd 

Sparse spectral ratio. A basic quantity that is important in our approximation ratio as we see in the follow- 
ing, is what we define as the sparse spectral ratio, which is equal to Ad+i/A^. This ratio will be shown to be 
directly related to the (non-sparse) spectrum of A. 

Here we prove the the following lemma. 

Lemma 3. Our approximation ratio is lower bounded as follows. 

A i 

Proof. We first decompose the quadratic form in |TJ in two parts 

x 1 Ax —x T ^ iViV ?^J x = xT A-dX + x T Adcx (18) 



where Ad<= = A — Ad = J27=d+i ^i v i v f- Then, we take maximizations on both parts of 1 18 1 over our feasible set 
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of vectors with unity I2 -norm and cardinality k and obtain 

max x T Ax = max (x T A d x + x T A d ax) 

<=lmaxx T Ax < maxi T A,jci + nmxx T A^cx 

X£§k %£§k X£^k 

^OPT < OPT d + maxa; T A d cx 
( <3> } OPT < OPT d + max x T A d .x 

M\2 = l 

( W0FT<0FT d + X d+1 , (19) 

where (i) comes from the fact that the maximum of the sum of two quantities is always upper bounded by the 
sum of their maximum possible values, (ii) is due to the fact that we lift the £q constraint on the optimizing 
vector x, and (Hi) is due to the fact that the largest eigenvalue of A — A d is equal to \ d +i- Moreover, due to the 
fact that OPT > OPT d , we have 

OPT -X d+1 < OPT d < OPT. (20) 



Dividing both the terms of 1 20 1 with OPT yields 



( k) 

Lower-bounding Aj . We will now give two lower-bounds on OPT. 



Ad+1 - 1 Ad+1 < < < 1. (21) 



(fc) OPT ~ OPT 



□ 



Lemma 4. The sparse eigenvalue of A is lower bounded as 



A^maxj^A^}. (22) 

Proof. The second bound is straightforward: if we assume the feasible solution e max , being the column of the 
identity matrix which has a 1 in the same position as the maximum diagonal element of A, then we get 

OPT > eLxCeLx = . max A tl = A^ . (23) 

1— l,...,n 

The first bound for OPT will be obtained by examining the rank-1 optimal solution on A\. Observe that 

OPT > OPTx = m.Bxx T Axx 

= X 1 ma,x(v'[x) 2 . (24) 

Both vi and x have unit norm; this means that (vf x) 2 < 1. The optimal solution for this problem is to allocate 
all k nonzero elements of x on Xfc(ui): the top k absolute elements of vi. An optimal solution vector, will give 
a metric of (vfx) 2 — || [vi]x k ( Vl ) III- The norm of the k largest elements of v\ is then at least equal to - times the 
norm of v\. Therefore, we have 

OPT>max j^A^A^J. (25) 

□ 

The above lemmata can be combined to establish Theorem 1 . 
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E Resolving singularities 



In our algorithmic developments, we have made an assumption on the curves studied, i.e., on the rows of the 
V d matrix. This assumption was made so that tie-breaking cases are evaded, where more than d curves intersect 
in a single point in the d dimensional space $ d . Such a singularity is possible even for full-rank matrices V d 
and can produce enumerating issues in the generation of locally optimal candidate vectors that are obtained 
through the intersection equations: 



b d-iel 



V d c = 0<j-ixi- 



(26) 



d—lxn 



The above requirement can be formalized as: no system of equations of the following form has a nontrivial 
(i.e., nonzero) solution 



- b d-ief d 

b d-l e td+1 



V d c ^ dx i 



(27) 



dxn 



for all c 7^ and all possible d + 1 row indices ii, . . . , i d+ \ (where two indices cannot be the same). We show 
here that the above issues can be avoided by slightly perturbing the matrix V d . We will also show that this 
perturbation is not changing the approximation guarantees of our scheme, guaranteed that it is sufficiently 
small. We can thus rewrite our requirement as a full-rank assumption on the following matrices 



rank 



- b d -ief d 



\ 



Va 



(28) 



dxi 



for all i\ ^ 12 7^ • • • 7^ id- Observe that we can rewrite the above matrix as 



e£ ~ b d-i4 d 
s£ - &d-ief d+1 j 



Vd 



dxn 



[Vd^-hiVd]^, 




' [V d ] iu : ' 




h[Vd]i 2 , 




[Vd]^,~HVdV d+1 , _ 




[Vdk, 
. [Vd] H , _ 




bi[V d ] id , 
. bi[V d ]i d+1 , _ 





where R il is a rank-1 matrix. Observe that the rank of the above matrix depends on the ranks of both of its 
components and how the two subspaces interact. It should not be hard to see that we can add ad x d random 
matrix A = [<5i <5 2 . . . 5 d ] to the above matrix, so that R il + Gi 2 ,...,i d+1 + A is full-rank with probability 1. 

Let E d be an n x d matrix with entries that are uniformly distributed and bounded as \E it j \ < e. Instead of 
working on V d we will work on the perturbed matrix V d = V d + E d . Then, observe that for any matrix of the 



previous form R l% + G v 



! we now have R il + Gi 



1 + [E d ] tl -®l d xi+E l2 



t , where 



Ei. 



[ E d]i 2 , 

\ E d]i 3 , 



[Ed] 



,l d +l 



(29) 



is full rank. 



i + 



Conditioned on the randomness of [E d ]i l!: , the matrix R il + Gi 2i ...,i d+1 + [E d \ ily . ® l d xi + E, 
Then due to the fact that there are d random variables in [E d ] il _ : and d 2 random variable in + G 
[E d ]i lt -. <S> ldxi + Ei 2 ,,,^i d+1 , the latter matrix will be full-rank with probability 1 using a union bounding argu- 
ment. This means that all Q) submatrices of V d will be full rank, hence obtaining the property that no d + 1 
curves intersect in a single point in $ d . 
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Now we will show that this small perturbation does not change our metric of interest significantly. The 
following holds for any unit norm vector x 



x r {V d + E d ){V d + E d f x = x T V d V d r x + x 1 E d E^x + 2x 1 V d E d x > x' V^x + 2x r V d E d x 



> x T V d Vjx - 2\\Vjx\\ • \\Ejx\\ > x T V d Vjx - 2 V /A 1 • Xi(E d Ej) 

and 

\\(V d + E d fxf < ||\/ d T a; || 2 + 211^1111^11 + ||£j a: || 2 < \\Vlxf + 2^/\ 1 -X 1 (E d Ej) + \ 1 (E d Ej) 

< \\Vlx\\ 2 +^\ 1 -\ 1 {E d Ej). 

Combining the above we obtain the following bound 

x T V d Vjx - iJx^X^Ej) < \\(V d + E d ) T x\\ 2 < \\Vjx\\ 2 + iJx^X^E^). 



By the above, we can appropriately pic k e such that • X\{E d Ej) = o(l). An easy way to get a bound 

on is via the Gershgorin circle theorem < |Horn & Johnson 1990 >, which yields Xi(E d Ej) < nd ■ e 2 . Hence, an 
e < „ — \, works for our purpose. 

VAinalogn r r 

To summarize, in the above we show that there is an easy way to avoid singularities in our problem. Instead 
of solving the original rank-d problem on V d , we can instead solve it on V d + E d , with an E d random matrix 
with sufficiently small entries. This slight perturbation only incurs an error of at most in the objective, 
which asymptotically becomes zero as n increases. 

F Twitter data-set description 

In Table |4j we give an overview of our Twitter data set. 



Data-set Specifications 



Geography 


mostly Greece 


time- window 


January 1-August 20, 2011 


unique user IDs 


~ 120k 


size 


~10 million entries 


tweets /month 


~ 1.5 million 


tweets /day 


~ 70k 


tweets/hour 


~3k 


words /tweet 


~ 5 


character limit/ tweet 


140 



Table 4: Data-set specifications. 



El Power Laws 

In this subsection we provide empirical evidence that our tested data-sets exhibit a power law decay on their 
spectrum We report these observations as a proof of concept for our approximation guarantees. Based on the 
spectrum of some subsets of our data-set, we provide the exact approximation guarantees derived using our 
bounds. 

In Fig. |6j we plot the best fit power law for the spectrum and degrees with data-set parameters given 
on the figures. The plots that we provide are for hour-length, day-length, and month-length analysis, and 
subsets of our data set based on a specific query. We observe that for all these subsets of our data set, the 
spectrum indeed follows a power-law. An interesting observation is that a very similar power law is followed 
by the degrees of the terms in the data set. This finding is compatible to the generative models and analysis 
of < |Mihail & Papadimitriou 2002 Chung et al. 2003) . The rough overview is that eigenvalues of A can be 
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(a) The plots that we provide are for hour-length, day-length, and (b) Approximation Guarantees: we show how the approximation 
month-length analysis, and subsets of our data set based on a spe- guarantees for these specific subsets of our data set scales with d. 
cific query. 

Figure 6: Power laws and approximation guarantees 

well approximated using the diagonal elements of A. In the same figure, we show how our approximation 
guarantees that based on the spectrum of A scales with d, for the various data-sets tested. We only plot for d 
up to 5, since for any larger d our algorithm becomes impractical for moderately large small data sets. 
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